8 research outputs found
String Covering: A Survey
The study of strings is an important combinatorial field that precedes the
digital computer. Strings can be very long, trillions of letters, so it is
important to find compact representations. Here we first survey various forms
of one potential compaction methodology, the cover of a given string x,
initially proposed in a simple form in 1990, but increasingly of interest as
more sophisticated variants have been discovered. We then consider covering by
a seed; that is, a cover of a superstring of x. We conclude with many proposals
for research directions that could make significant contributions to string
processing in future
Computation of the suffix array, burrows-wheeler transform and FM-index in V-order
V-order is a total order on strings that determines an instance of Unique Maximal Factorization Families (UMFFs), a generalization of Lyndon words. The fundamental V-comparison of strings can be done in linear time and constant space. V-order has been proposed as an alternative to lexicographic order (lexorder) in the computation of suffix arrays and in the suffix-sorting induced by the Burrows-Wheeler transform (BWT). In line with the recent interest in the connection between suffix arrays and Lyndon factorization, in this paper we obtain similar results for the V-order factorization. Indeed, we show that the results describing the connection between suffix arrays and Lyndon factorization are matched by analogous V-order processing. We also describe a methodology for efficiently computing the FM-Index in V-order, as well as V-order substring pattern matching using backward search
Practical KMP/BM Style Pattern-Matching on Indeterminate Strings
In this paper we describe two simple, fast, space-efficient algorithms for
finding all matches of an indeterminate pattern in an
indeterminate string , where both and are defined on a
"small" ordered alphabet say, . Both
algorithms depend on a preprocessing phase that replaces by an integer
alphabet of size which (reversibly, in time
linear in string length) maps both and into equivalent regular strings
and , respectively, on , whose maximum (indeterminate) letter
can be expressed in a 32-bit word (for , thus for DNA sequences,
an 8-bit representation suffices). We first describe an efficient version KMP
Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of
in (that is, of in ), but, whenever necessary, using the prefix
array, rather than the border array, to control shifts of the transformed
pattern along the transformed string . We go on to describe a similar
efficient version BM Indet of the Boyer- Moore algorithm that turns out to
execute significantly faster than KMP Indet over a wide range of test cases. A
noteworthy feature is that both algorithms require very little additional
space: words. We conjecture that a similar approach may yield
practical and efficient indeterminate equivalents to other well-known
pattern-matching algorithms, in particular the several variants of Boyer-Moore